# CDSP: AN APPLICATION-SPECIFIC DIGITAL SIGNAL PROCESSOR FOR THIRD GENERATION WIRELESS COMMUNICATIONS

Po-Chih Tseng, Chi-Kuang Chen and Liang-Gee Chen DSP/IC Design Lab, Department of Electrical Engineering National Taiwan University, Taipei, Taiwan, R.O.C.

# ABSTRACT

In this paper, we present an application-specific Digital Signal Processor (DSP) for third generation wireless communications. The processor architecture and instruction set of proposed DSP are specially designed for the WCDMA system, where the Viterbi decoding and complex arithmetic are enhanced. These key features make the proposed DSP consume much lower processor MIPS and therefore outperform prior arts in terms of several crucial operations of wireless applications. A prototyping chip has been implemented by using cell-based design approach under TSMC 0.35um CMOS 1P4M technology. At 3.3V supply voltage, it can operate up to 40 MHz.

**Keyword:** application-specific Digital Signal Processor, third generation wireless communications.

### **1. INTRODUCTION**

Wireless communications related products are more and more popular in these years. The requirement of data transmission through mobile radio interface also increases rapidly. The first generation wireless systems use analog techniques to provide cellular phone, mainly for voice activity. The second generation (2G) systems, such as GSM or IS-95, are based on modern digital technology to provide mobile services for both voice and data. The data rates supported by 2G systems are from 9.6Kb/s to 14.4Kb/s. However, most of the third generation (3G) mobile communication proposals suggest the Wide-band Code Division Multiple Access (WCDMA) [1][2] in order to provide wireless real-time multimedia services, demanding for much higher data rate up to 2Mb/s. As a result, the 3G systems will require significantly higher signal processing capabilities than 2G systems.

Programmable Digital Signal Processors (DSPs) have been widely applied in wireless systems to support necessary, system flexibility and upgradability. In order to achieve the increasing system specification of wireless systems, several application-specific DSPs targeting wireless communications [3][4][5][6] have been proposed. These DSPs usually possess moderate processing capabilities for recent 2G wireless systems, being applied to specific areas such as speech codec and may be capable to implement several key operations in wireless systems. However, since the 3G systems require much more processing capabilities, these communication DSPs originally applied in 2G systems may no longer offer sufficient computational capabilities for 3G systems.

In this paper, we present an application-specific Digital Signal Processor, called CDSP, which is suitable for 3G wireless communications. The CDSP is a computational more efficient DSP with processor architecture and instruction set specially designed for the WCDMA system. These key features make the CDSP consume much lower processor MIPS and therefore outperform prior arts [3][4][5][6] in terms of several crucial operations of wireless applications. In the following of this paper, we first illustrate the suggested WCDMA baseband processor, and then describe the detailed architecture of CDSP in section 2. In section 3, several special instructions for the key operations of WCDMA system are discussed. Section 4 shows the performance comparison results between CDSP and several communication DSPs in terms of some key operations of wireless applications. The chip implementation issues are discussed in section 5. Finally, a brief summary is given to conclude this paper.

# **2. PROCESSOR ARCHITECTURE**

### 2.1 WCDMA Baseband Processor

Most of the implementations for CDMA baseband processor include a correlator array and a DSP. Fig. 1 shows the suggested WCDMA baseband processor, including a DSP, a correlator array, a system controller, and several code generators. The system controller controls the tasks of baseband signal processing, providing the code seeds to the code generators and handshaking with the DSP. The code generators receive the code seeds and generate the corresponding codes to correlator array. The processing in chip-rate is handled by the correlator array, and the symbol-rate processing is handled by the DSP.



Figure 1: Suggested WCDMA baseband processor

#### 2.2 CDSP Architecture

Before designing the DSP architecture, the simulation of WCDMA system is first considered. After the simulation, several key operations that can be well executed by the DSP are then extracted. The CDSP is specially designed for symbol-rate I/Q channel data processing, such as channel estimation, RAKE combining, Viterbi algorithm, and FIR filtering that is widely used in communication systems. The block diagram of CDSP is shown in Fig. 2. Detailed design issues are discussed in the following subsections.

#### A. Architecture Overview

Modified Harvard architecture including one program memory and two two-port data memories is used in CDSP. Each memory has a 16-bit addressing space. In a single clock cycle, one instruction fetch from program memory, two operand reads from two two-port data memories, and two data writes back to two data memories can be simultaneously performed. Such architecture sufficiently offers high memory bandwidth while DSP operating. The data memories use 16-bit word width, and the program memory uses 28-bit word width. Each data memory has its own address generator (AG) to provide the address for corresponding data memory.

The pipeline stages of CDSP follows the flow of data. There are five pipeline stages in CDSP: instruction fetch, instruction decode, operand read, execution, and write back as shown in Fig. 3. One instruction execution time is then equivalent to one instruction cycle. CDSP also has five interrupt vectors and a general parallel I/O port that are useful to communicate data with outside components.

#### B. Special Instructions with SIMD Computational Model

Based on the WCDMA simulation results, we design

several special instructions for the key operations. We use the SIMD (Single Instruction Streams, Multiple Data Streams) computational model to execute several functional units of datapath in parallel by a single instruction. Therefore, CDSP can efficiently perform many key operations of the WCDMA system.



Figure 2: Block diagram of CDSP

| Instrcution Instrcution Operand<br>Fetch Decode Read | Execution | Write<br>Back |
|------------------------------------------------------|-----------|---------------|
|------------------------------------------------------|-----------|---------------|

Figure 3: Five-stage pipeline in CDSP

### C. Data Path with Sub-Word Parallelism (SWP)

The key features of CDSP are its four datapath units: ALU (Arithmetic Logic Unit), MAC (Multiply Accumulator), CMP (Comparator), and SFT (Barrel Shifter). The inputs for ALU, SFT, CMP, and accumulator of MAC are 40-bit wide, and inputs for multipliers of MAC are 16-bit wide. The outputs of datapath can be stored into one of the two 40-bit accumulators (D0 or D1) or two data memories. All the executions of the four datapath units are finished in a single clock cycle.

According to our simulation results of WCDMA system, 6-bit word length is enough for correlator output [7]. Therefore, a normal 16-bit DSP datapath can be separated into two 8-bit data for I channel and Q channel respectively, as shown in Fig. 4. By this SWP architecture of datapath, the symbol rate I/Q channel data processing can be efficiently accelerated.



Figure 4: SWP data format in CDSP

For example, the SWP architecture of MAC can be clearly shown in Fig. 5. As shown in this figure, four parallel 8-bit multipliers are needed to support both 8 by 8 and 16 by 16 multiplication operations.



Figure 5: Block diagram of MAC with SWP

#### D. Local Data Bus

In addition to the two two-port data buses, the CDSP also has three local data buses as shown in the bottom of Fig. 2. These buses can further keep the local data relationship. Along with the two 40-bit accumulators, they can offer various input sources for the four datapath units.

#### E. Hardware Circular Buffers

Many DSP algorithms such as digital filters require circular data buffers. Each AG in CDSP has eight index registers and supports modulo addressing mode, using hardware to handle the wrapping of address index, to simplify the implementation of circular buffers.

# F. Zero-Overhead Looping

DSP algorithms are repetitive and are most logically expressed as loops. The program sequencer in the CDSP supports looped code with zero overhead, combining excellent performance with the clearest program structure.

#### G. Idle Mode for Power Management

In the low power design consideration, the power management of CDSP supports the idle mode to reduce the power consumption when the CDSP is not under operating condition.

# **3. SPECIAL INSTRUCTIONS**

Based on the WCDMA simulation results, several special instructions are designed for the key operations.

#### 3.1 Channel Estimation for RAKE Combining

The main idea of channel estimation is to use the known pilot symbol to obtain current mobile channel response. The obtained information can be used for maximum-ratio-combining as the spirit of RAKE receiver. In this operation, complex arithmetic such as multiplication and multiplication-accumulation are needed. In a normal 16-bit DSP, it will take six clock cycles to complete a complex multiplication that contains four multiplication operations and two addition/subtraction operations. Since CDSP supports both 8-bit and 16-bit data type with SWP architecture, the four 8-bit multipliers of the MAC (as shown in Fig. 5) make it perform a complex MUL/MAC in one instruction cycle. The results of complex MUL/MAC become two 16-bit words for real and imaginary part. These two words can be stored into two 16-bit data memories or one of the two 40-bit accumulators.

#### 3.2 Viterbi Algorithm

Convolutional encoding is the most popular error correction coding method in wireless communication systems. Convolutional encoded data is decoded by using the Viterbi algorithm on trellis diagram [8]. There are two steps in Viterbi decoding process as discussed in [9]. The first step is the metric update, and the operation ACS (Add-Compare-Select) is used. Since this step is most time-consuming in Viterbi decoding, some communication DSPs, such as TI TMS320C54x [3], use special instructions to speed up the ACS operation. Even though, it still requires five instruction cycles to complete one butterfly calculation. In CDSP, both SIMD and SWP architecture are used to accelerate the ACS operation as shown in Fig. 6. The MAC bypasses the multipliers and behaves as a 24-bit substractor/adder and a 16-bit adder/subtractor. The ALU also behaves as a 24-bit adder/subtractor and a 16-bit subtractor/adder. At the same time, the CMP selects and saves the minimum distance of previous calculated path distances to data memories. Therefore, two ACS operations can be completed in one clock cycle. This is also the reason for separating the CMP from the ALU.



Figure 6: Data flow of dual ACS operations

Obviously, two memory reads and two memory writes operations are needed to be completed per clock cycle. This is the main reason why we use two-port data memories for CDSP. According to the indexing of the metric data shown in Fig. 7(a), its exclusion graph is drawn in the left of Fig. 7(b) based on the two-port data memories. By this exclusion graph, the metric data must be arranged as shown in the right side of Fig. 7(b). In this way, the data flow is smooth for the storage of each trellis diagram.



Figure 7(a): R=1/2, K=9 butterfly in trellis diagram



Figure 7(b): Data arrangement of the metric update

The next step of the Viterbi algorithm is the trace back operation. Here we use the architecture in [10] and support for constraint length from 5 to 9.

### 3.3 FIR Filtering

The SWP architecture of MAC also speeds up the FIR filtering operation by a factor of two. When performing FIR filtering, the left two 8-bit multipliers, the middle stage 32-bit adder below the crossbar, and the last stage 40-bit accumulator of MAC are active to complete the operation. The memory arrangement of input data and coefficients is shown in Fig. 8. Input data sequence is stored in one data memory in packed form, and coefficients are stored in another data memory also in packed form.





## 4. PERFORMANCE COMPARISON

The CDSP has been compared with some well-known communication DSPs including TI C54x [3]/C55x [4], LODE [5], and MDSP-II [6]. Table 1 shows the performance comparison results of CDSP with these DSPs in terms of convolutional decoding (R=1/2) for GSM, IS-95, WCDMA (3G), and also FIR filtering and complex arithmetic operations.

| 1. The second |             |             |          |             |         |
|-----------------------------------------------------------------------------------------------------------------|-------------|-------------|----------|-------------|---------|
| Operations                                                                                                      | TI C54x [3] | TI C55x [4] | LODE [5] | MDSP-II [6] | CDSP    |
|                                                                                                                 | (1 MAC)     | (2 MACs)    | (2 MACs) | (2 MACs)    | (1 MAC) |
| GSM Conv. (K=5)                                                                                                 | 0.58        | 0.276       | 0.44     | 0.8         | 0.093   |
|                                                                                                                 | MIPS        | MIPS        | MIPS     | MIPS        | MIPS    |
| IS-95 Conv. (K=9)                                                                                               | 6.57        | 3.13        | 5.2      | 9.01        | 1.38    |
|                                                                                                                 | MIPS        | MIPS        | MIPS     | MIPS        | MIPS    |
| 3G Conv. @ 384kbps                                                                                              | 263         | 121         | 208      | 360         | 55      |
|                                                                                                                 | MIPS        | MIPS        | MIPS     | MIPS        | MIPS    |
| FIR Filtering (N-tap)                                                                                           | N           | N/2         | N/2      | N/2         | N/2     |
|                                                                                                                 | Cycles      | _Cycles     | Cycles   | Cycles      | Cycles  |
|                                                                                                                 | 4           | 2           | 2        | 2           | 1       |
| Complex MUL/MAC                                                                                                 | Cycles      | Cycles      | Cycles   | Cycles      | Cycles  |

Table 1 Performance Comparison Results

From Table 1, we can find out that CDSP has about 4 times to 8 times performance at the convolutional decoding and 2 times to 4 times performance at the complex

arithmetic. It also has the same performance with some DSPs that have two MAC units when performing FIR filtering operation. Therefore, for these key operations of wireless applications, CDSP consumes much lower processor MIPS than other communication DSPs.

# **5. CHIP IMPLEMENTATION**

A prototyping chip has been implemented by using cell-based design approach under TSMC 0.35um CMOS 1P4M technology. The detail chip specification is shown in Table 2. Fig. 9 shows the chip microphotograph.

| Table 2 Features of CDSP p | prototyping chip |
|----------------------------|------------------|
|----------------------------|------------------|

| Package           | 208 CQFP                  |  |
|-------------------|---------------------------|--|
| Die Size          | 5.6 x 5.6 mm <sup>2</sup> |  |
| Max Clock Rate    | 40MHz (@ 3.3V)            |  |
| Pipeline          | Five-stage                |  |
| Transistor Count  | 480K (excluding memories) |  |
| On-Chip Memory    | One 1K x 28-bit (program) |  |
|                   | Two 2K x 16-bit (data)    |  |
| Design Methodlogy | Cell-based design         |  |



Figure 9: Chip microphotograph

# 6. CONCLUSION

In this paper, an application-specific digital signal processor is presented for third generation wireless communications. The DSP is specially designed for symbolrate data processing of WCDMA system and outperforms other communication DSPs in several key operations of wireless applications. We believe that the proposed DSP architecture would be useful in 3G wireless systems that support much higher data rate than 2G systems.

## REFERENCE

- R. Prasad, "An overview of CDMA evolution toward wideband CDMA," *IEEE Communications Surveys*, vol. 1, no. 1, pp. 2–29, 1998.
- [2] E. Dahlman, P. Beming, J. Knutsson, F. Ovesj" o, M. Persson, and C. Roobol, "WCDMA-the radio interface for future mobile multimedia communications," *IEEE Transactions on Vehicular Technology*, vol. 47, pp. 1105–1118, Nov. 1998.
- [3] W. Lee, P. E. Landman, Brock Barton, S. Abiko, H. Takahashi, H. Mizuno, S. Muramatsu, K. Tashiro, M. Fusumada, L. Pham, F. Boutaud, E. Ego, G. Gallo, H. Tran, C. Lemonds, A. Shih, M. Nandakumar, R. H. Eklund, and I-C. Chen, "A 1-V Programmable DSP for Wireless Communications," *IEEE Journal of Solid-State Circuits*, vol. 32, pp. 1766-1776, 1997.
- [4] "TMS320C55x DSP Core Technical CPU Brief," Technical Report, TEXAS INSTRUMENTS, 2000.
- [5] I. Verbauwhede, M. Touriguian, "A Low Power DSP Engine For Wireless Communications," *Journal of* VLSI Signal Processing, vol. 18, 1998.
- [6] B.-W. Kim, J.-H. Yang, C.-S. Hwang, Y.-S. Kwon, K.-M. Lee, I.-H. Kim, Y.-H. Lee, and C.-M. Kyung, "MDSP-II: A 16-Bit DSP with Mobile Communication Accelerator," *IEEE Journal of Solid-State Circuits*, vol. 34, pp. 397-404, 1999.
- [7] Chi-Kuang Chen, Po-Chih Tseng, Yung-Chi Chang, and Liang-Gee Chen, "A Digital Signal Processor with Programmable Correlator Array Architecture for 3rd Generation Wireless Communication System," Submitted to IEEE Transactions on Circuits and Systems II.
- [8] G. D. Forney, "The Viterbi algorithm," Proceedings of IEEE, pp. 268–278, 1973.
- [9] H. Hendrix, "Viterbi Decoding Techniques in the TMS320C54X Family," Application Report, TEXAS INSTRUMENTS, 1996.
- [10] T. Ishikawa, H. Suzuki, N. Minamida, R. Yamanaka, M. Okamoto, H. Kabuo, and H.Taki, "W-CDMA hardware-related issues," *International Conference* on Communication Technology, pp. S22-05-1-S22-05-5, 1998.

# BIOGRAPHY



**Po-Chih Tseng** was born in Tao-Yuan, Taiwan in 1977. He received the B.S. degree in Electrical and Control Engineering from National Chiao Tung University in 1999 and the M.S. degree in Electrical Engineering from National Taiwan University in 2001. He currently is pursuing the Ph.D degree at the Institute of Electronics Engineering, National Taiwan University. His research interests include digital signal processor design, architecture design for image and video coding system, and reconfigurable computing system for multimedia applications.



**Chi-Kuang Chen** was born in Taipei, Taiwan in 1976. He received the B.S and M.S degree in Electrical Engineering from National Taiwan University in 1998 and 2000 respectively. In 2000, he joined the DSP group of VIVOTEK Inc., and is currently engaged in audio codec programming on DSP processors. military service during 1987 and 1988, he was an Associate Professor in the Institute of Resource Management, Defense Management College. From 1988, he joined the Department of Electrical Engineering, National Taiwan University. During 1993 to 1994 he was Visiting Consultant of DSP Research Department, AT&T Bell Lab, Murray Hill. At 1997, he was the visiting scholar of the Department of Electrical Engineering, University, of Washington, Seattle. Currently, he is Professor of National Taiwan University. His current research interests are DSP architecture design, video processor design, and video coding system.

Dr. Chen is a Fellow of IEEE. He is also a member of the honor society Phi Tan Phi. He was the general chairman of the 7th VLSI Design/CAD Symposium. He is also the general chairman of the 1999 IEEE Workshop on Signal Processing Systems: Design and Implementation. He serves as Associate Editor of IEEE Trans. on Circuits and Systems for Video Technology from June 1996 till now and the Associate Editor of IEEE Trans. on VLSI Systems from January 1999. He was the Associate Editor of the Journal of Circuits, Systems, and Signal Processing from 1999. He served as the Guest Editor of The Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology. He is also the Associate Editor of the IEEE Trans. on Circuits and Systems II: Analog and Digital Signal Processing.

Dr. Chen received the Best Paper Award from ROC Computer Society in 1990 and 1994. From 1991 to 1999, he received Long-Term (Acer) Paper Awards annually. In 1992, he received the Best Paper Award of the 1992 Asia-Pacific Conference on Circuits and Systems in VLSI design track. In 1993, he received the Annual Paper Award of Chinese Engineer Society. In 1996, he received the Out-standing Research Award from NSC, and the Dragon Excellence Award for Acer. He is elected as the IEEE Circuits and Systems Distinguished Lecturer from 2001-2002.



Liang-Gee Chen was born in Yun-Lin, Taiwan, in 1956. He received the BS, MS, and Ph.D degrees in Electrical Engineering from National Cheng Kung University, in 1979, 1981, and 1986, respectively.

He was an Instructor (1981 -- 1986), and an Associate Professor (1986 -- 1988) in the Department of Electrical Engineering, National Cheng Kung University. In the